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often follow questionable recommendations and adopt advice poorly supported by scientific 
data. The key goal of the present work is to explore the idea that Twitter, as a highly popular 
platform for information exchange, could be used as a data-mining source to learn about 
the population affected by ASD - their behaviour, concerns, needs etc. To this end, using 
a large data set of over 11 million harvested tweets as the basis for our investigation, we 
describe a series of experiments which examine a range of linguistic and semantic aspects 
of messages posted by individuals interested in ASD. Our findings, the first of their nature 
in the published scientific literature, strongly motivate additional research on this topic and 
present a methodological basis for further work. 
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1 Introduction 

In this paper we are interested in leveraging the remarkable increase in the use of social me¬ 
dia, and Twitter in particular, to obtain information about specific demographic communities 
which are difficult to reach by conventional means. To illustrate our ideas and demonstrate 
their effectiveness, herein we focus on the population affected by the autism spectrum disor¬ 
der (ASD) - a neurodevelopmental disorder which has been attracting increasing attention 
as much for its complex and varied aetiology, as for the associated and rapidly growing so¬ 
cioeconomic burden. The population of interest in this work includes both individuals who 
have been diagnosed with ASD themselves, as well as those who are indirectly but signifi¬ 
cantly impacted such as the family members and carers of ASD sufferers. To the best of our 
knowledge ours is the first work of this nature. The present paper builds on the preliminary 
results reported in [T| and describes a number of novel findings and experiments whose re¬ 
sults further support the key underlying idea and motivate additional work in this direction. 
Most notable novelties include: 

- additional statistics of the collected tweet corpus, 

- quantitative analysis of Zipf’s law for tweets and ASD-related tweets in particular, 

- examination of statistical significance of differential part-of-speech characteristics, 

- experiments using least absolute shrinkage and selection operator (LASSO), 

- additional evidence of content saliency using an Alzheimer’s disease tweet corpus, and 

- new analysis of the procedure for the selection of the bootstrap keyword set. 
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It should also be noted that the results reported in this paper include a significantly expanded 
corpus of tweets (nearly an order of magnitude greater) as we continued collecting data 
following the publication of our original work. 

The remainder of this paper is structured as follows. In the next section we describe 
some of the key aspects of the autism spectrum disorder which motivated our focus on this 
particular condition. In Section[3]we review relevant previous work on the use of Twitter for 
data mining and highlight the key techniques and methods in the field. Section [4] describes 
the main contribution of our work - we start by describing the manner in which the data used 
for the present study was collected and follow up with a description, results, and discussion 
of a sequence of increasingly complex experiments. A summary of the work and its findings 
is made in Section[5] 

2 Motivation and relevant background 

Autism spectrum disorder is a life-long neurodevelopmental disorder with poorly under¬ 
stood causes on the one hand, and a wide range of potential treatments supported by little 
evidence on the other. The disorder is characterized by severe impairments in social interac¬ 
tion, communication, and in some cases cognitive abilities, and typically begins in infancy 
or at the very latest by the age of three. ASD is recognized as comprising an aetiologically 
and clinically heterogeneous group of conditions whose diagnosis remains to be based solely 
on the complex behavioural phenotype 0- According to the definition in the latest version 
(5th edition) of the Diagnostic and Statistical Manual of Mental Disorders, the autism spec¬ 
trum disorder includes disorders which were previously diagnosed with more specificity 
as autism, Asperger syndrome, Rett syndrome, childhood disintegrative disorder, and ‘per¬ 
vasive developmental disorder not otherwise specified’ 0. Current evidence suggests that 
approximately 0.5-0.6% of the population is afflicted by ASD though the actual diagnosis 
rate is on the increase due to the broadening diagnostic criteria (4)- The condition is usually 
detected in early childhood when an abnormal lack of social reciprocity is observed. 

Considering the social and economic burden of ASD it is unsurprising that it has been 
attracting an increasing amount of research attention; numerous longitudinal, epidemiolog- 
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ical, and family studies have been conducted 0. This is illustrated by the plot in Fig. |T] 
which shows the number of academic papers in English with the word “autism" in the ti¬ 
tle or abstract (as indexed by the PubMed portal which interfaces the US National Library 
of Medicine life and biomedical sciences database). The earliest work is that by Kanner 
in 1946 m with a readily observed exponential increase thenceforth. Note that the count 
for the year 2014, which seemingly bucks the trend, is in fact perfectly in agreement with 
the overall increase as only papers indexed by PubMed up to and including May 2014 are 
included. 



Fig. 1: The number of academic papers in English with the word “autism” in the title or 
abstract as indexed by the PubMed portal which interfaces the US National Library of 
Medicine life and biomedical sciences database. The earliest work is that by Kanner in 
1946 0; thereafter an exponential increase is readily observed. Note that the count for 
2014 is incomplete as it includes papers indexed by PubMed only up to and including May 
2014. 


Although the last few decades have seen significant progress in the study of ASD, the 
still relatively poorly understood aetiology of the condition, its phenotypical heterogene¬ 
ity (3, and stigma associated with mental conditions |8), have all contributed to the pen¬ 
etration of beliefs, and behavioural and educational interventions which are often ques- 
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tionable [51 and poorly supported by evidence (e.g. gluten-free and casein-free diets, and 
cognitive behavioural therapy llOl ). and sometimes outright in conflict with science Hill . 
For example a recent review of early intensive behavioural and developmental interventions 
for young children with ASD found 1 existing study as being of good quality, 10 as fair 
quality, and 23 as poor quality [9j. From the public policy point of view, understanding the 
practices and beliefs of parents and carers of ASD-affected individuals is crucial, yet often 
lacking lfT2l . 

Thus the key idea motivating the present work is that the rapid rise in the adoption of 
social media as a platform for the expression and exchange of ideas, which facilitates the 
emergence of special interest communities, can be used to study and monitor the beliefs 
and practices of the population affected by ASD. Considering the challenge of reaching and 
engaging with this specific target population, our findings pave the way for further work of 
potential significant benefit to public health. These benefits include the enrichment of the 
corpus of knowledge of the condition itself by the medical community, and the increased 
understanding of the practices and concerns of those affected by ASD. 

3 Previous work 

While this is the first work exploring the possibility of using Twitter data for the extraction 
of ASD-related information, the broad idea of data-mining Twitter is not new and has been 
employed successfully in a variety of applications. At the same time it is important to stress 
at the very outset that most of the previous work on this topic has not been automatic, that is, 
analysis was performed ‘manually’ by humans. This is a laborious process which severely 
limits how much data can be processed. Additionally, the use of human intelligence rather 
than computer-based methods means that in the case of many reported results, it is not clear 
that the same results could be obtained automatically due to a possible semantic gap. 

A popular research direction focuses on various forms of prediction based on tweet 
sentiments CD inferred from the tweet’s emoticons 03 or using linguistics-based classi¬ 
fiers IT5I . For example, Asur and Huberman llbtl showed that tweet posting rate can be used 
to forecast film box office revenues, while Baucom et al. G3 used sentiment to analyse the 
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relationship between the content of tweets and the outcomes of NBA Playoff games. Bollen 
et al. EDD showed that tweet sentiments, aggregately seen as a reflection of public mood 
levels, can be used to predict the incidence of socio-political, cultural, and economic events. 
Mitchell et al. G3 used geo-tagged tweets across the United States to estimate localized 
changes in a variety of sociometric indices such as happiness, the level of education, and 
obesity rates. 

The quasi-realtime nature of Twitter also makes it a potentially valuable resource for the 
detection and management of emergency situations |20l . For example, Robinson et al. I 2fl 
described an earthquake detection system, while Sakaki et al. l22l used a spatio-temporal 
model of tweet frequencies to infer the location of the epicentre of an earthquake. 

Closer in spirit to the nature of work in the present paper is the corpus of work on the 
use of Twitter in the domain of health care. For example, Paul and Dredze (23l used tweets 
to extract words related to symptoms and treatments, and a topic model to associate them 
with the corresponding ailments. In their subsequent work (24), the model was extended to 
track the spread of illnesses over time, measure behavioural risk factors, and analyse symp¬ 
toms and medication usage. The use of LDA-based health topic modelling was explored by 
Prier et al. 1251 . Dementia-related tweets were the focus of work by Robillard et al. 1261 
who collected relevant tweets over a 24h period using keyword filtering and used them to 
discover the dominant chatter themes. Similarly, Scanfeld et al. | 27 | used Twitter to analyse 
the patterns of antibiotic use. Jashnisky et al. [28] studied the relationship between Twit¬ 
ter conversations deemed to reflect high suicide risk and actual suicide rates in the United 
States. They demonstrated that high risk individuals may be recognized from their social 
media status. Lastly, Flimelboim and Han | |29l examined the connectivity patterns of Twitter 
users interacting within a specific online community of cancer-affected individuals. 

In contrast to the sporadic studies on different diseases described above, the use of Twit¬ 
ter data in the management of highly contagious diseases like influenza has attracted a more 
concentrated research effort. For example, Culotta 1301 investigated whether the frequency 
of influenza epidemic-related tweets can be related to ‘ground truth’ data from centres for 
disease control and prediction. Achrekar et al. ED showed that the emergence and the 
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spread of epidemic influenza can be predicted and tracked from the location and demo¬ 
graphic information of users of relevant tweets. A similar approach was also described by 
Li and Cardie 1321 . Yet further evidence of the power of Twitter data was presented by Chew 
and Eysenbach |33ll who demonstrated that the spatio-temporal distribution of relevant tweet 
frequencies during the 2009 H1N1 outbreak closely matches the disease spreading pattern. 

Although the use of Twitter for data-mining information related to ASD has not been 
explored yet, there has been some preliminary work on the use of other social media and 
ASD. For example, Newton et al. | [34l used Linguistic Inquiry and Word Count (LIWC) 
dictionaries to compare writing patterns of individuals with ASD and those of neuro-typical 
bloggers. 


4 Methods, results, and discussion 

Having outlined the motivation and our ultimate vision for this work, and placed it in context 
of previous research on data-mining social media, we now turn our attention to the main 
contribution of the present paper. We start by describing the data set we used for our analysis. 


4.1 Data acquisition 

Twitter’s Terms of Service explicitly prohibit the sharing or redistribution of tweets, includ¬ 
ing for research purposes. Consequently, there was no public data set that we could use as a 
standard benchmark in this study. Instead, we collected a large data set ourselves. 

Twitter API offers different means of retrieving tweets. In particular, we used its ‘search’ 
and ’streaming’ functions. The former allows the retrieval of historical tweets based on the 
presence of specific keywords and meta-data constraints (e.g. on the language or user lo¬ 
cation). After being posted, a tweet can be obtained in this manner for up to a week. The 
streaming API allows a quasi-realtime retrieval of tweets as they are posted, retrieving a 
sample of approximately 1% of all tweets. The search API was most valuable for us for col¬ 
lecting ASD-related content, as we will describe in detail shortly. Conversely, the streaming 
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API allowed us to obtain ‘control’ data, unrelated to ASD, since this set could not be well 
characterized a priori using a compact set of keywords. 

We collected only tweets posted in English. To facilitate a comparative analysis, we 
collected two non-overlapping data sets. The first of these, which we will refer to as the 
ASD subset, comprises tweets which concern ASD. Specifically, we defined the ASD subset 
as comprising tweets which contain any of the four keywords “autism”, “adhd”, “asperger”, 
and “aspie” (or any of their derivatives obtained by suffixation), and the control subset as 
comprising all other tweets. In total this resulted in a corpus of 5,650,989 ASD-related 
tweets collected in the period starting on 26 August 2013 and ending on 1 Oct 2014 (i.e. 
more than 13 consecutive months). Of these, 3,493,742 were original tweets which were 
produced by 1,771,274 unique users, with the remaining 28% of the messages being re¬ 
tweets. Approximately 25% of the collected tweets (2,260,284) contain so-called hashtags 
(see Section [4.4.2| for discussion and analysis). Although this information was not used in 
the present work we also report that in our corpus 70,925 tweets are geo-tagged (it should 
be noted that geo-tags, being controlled entirely by the user, are not necessarily correct 
location identifiers), 2,599,395 contain URLs, 464,190 were sent in reply, and 3,330,096 
mention another user. 

It is important to observe that it is not our claim that tweets in the ASD subset were 
necessarily posted by individuals suffering from ASD. While some of the messages in this 
subset do fall into this category, the subset will also include posts by individuals affected 
by ASD in a looser sense, e.g. parents or carers of those who suffer from ASD, or indeed 
medical professionals interested in the condition. 

4.1.1 Pre-processing 

Much of the work in the present paper concerns the analysis of topics discussed by means 
of Twitter. In this context it is beneficial to have different inflections of the same word 
normalized and represented by a single term. In linguistics this process is referred to as 
lemmatization and we apply it automatically using the freely available Natural Language 
Toolkit (NLTK) 1351 . In addition, we remove the so-called ‘stop words’ which do not carry 
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Fig. 2: Number of ASD-related tweets we collected each day over the course of our data 
acquisition period; notice weekly periodicity. To make the usual day to day variability visible 
in the presence of the sharp peak on 2 April (World Autism Awareness Day) logarithmic 
scaling was used for ordinate values. The approximately three week gap in data collection 
during the 2014-2015 holiday period was caused by the failure of our computer system. 

much meaning themselves (e.g. articles and connectives), as well as all punctuation marks 
and emoticons. To illustrate the effects of our pre-processing we present a few examples. 

Original tweet: 

Looks like we will have more #autism research happening for children in #Earlyln- 
ten’ention next year! :-) WVisualSupports #MobileTecli 

Pre-processed result: 

look like autism research happen child earlyintervention visualsupports mobiletech 

Original tweet: 

Authors who see autism as “tremendously burdening" elicit dire views of autism 
from parents http://j.mp/lFyqXwF “Ethical approval: none” 

Pre-processed result: 

author see autism tremend burden elicit dire view autism parent url ethic approv 


none 
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Original tweet: 

101 autism: Genetic analysis of individuals with autism finds gene deletions - Using 
powerful genetic sequencing. http://is.gd/UhprQK 

Pre-processed result: 

101 autism genet analysi individu autism find gene delet use power genet sequenc 
url 

Original tweet: 

#Apple #Censorship & Dr. Brian Hooker Interview exposing CDC Cover-up of the 
Vaccine & Autism Link on ,@rediceradio http://youtu.be/19iivPtg6SPI 

Pre-processed result: 

appl censorship dr brian hooker interx’iew expos cdc cover vaccin autism link atus 
url 

4.2 Methods and results 

In this section we describe a series of experiments aimed at discovering the properties of 
the collected data. We start with a generic quantitative linguistic analysis and proceed with 
increasingly domain-specific considerations which examine tweet content and the potential 
differentiation between the ASD-related and the control corpora. 

4.2.1 Power law: Zipf’s distribution 

The first experiment we conducted was set to find out if tweets, both in the ASD corpus as 
well as the control group, obey the so-called Zipf’s law. In its general form, this empirical 
law posits that the frequency P, of an ‘event’ is approximately a power function of its fre¬ 
quency rank r,, i.e. Pi oc rf where ry is the rank (r.; = 1 ,.. .) of the event when events are 
ordered by their frequency of occurrence from high to low and a < — 1. Zipf’s distribution 
can be seen to be a type of a power law probability distribution. This statistical regularity is 
observed in a variety of domains ranging from deadly quarrel [361 and wealth distribution 
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analysis EU, to information retrieval 8381 and quantitative linguistics ll38l l. In quantitative 
linguistics, an event corresponds to the usage of a particular word or, more precisely in the 
context of the present paper, a particular term resulting from the pre-processing of a word 
as described in Section l4.1.1l 

Evidence from previous work suggests that conventional texts, such as newspapers or 
books, do result in Zipfian distributions of word frequencies [38]. However, it is not clear 
from this that the same applies to tweets. Firstly, tweets are restricted in length to 140 char¬ 
acters which by itself may alter linguistic characteristics of posted messages. In addition, the 
nature of Twitter as a communication medium invariably introduces a self-selecting aspect: 
neither can the corpus of Twitter users be considered to be a random sample from the pop¬ 
ulation, nor can the memetic content of tweets be expected to match that of texts examined 
by previous work, such as newspapers and books. 

In our data set, the ASD corpus of tweets contains 47,048,097 terms (pre-processed 
words) of which 402,946 are unique; the control subgroup used 32,846,321 terms of which 
528,755 are unique. The key findings are summarized in Fig. |3(a)| which shows the plot of 
tweet term frequency as a function of its overall term frequency rank. It is readily apparent 
that the characteristics of both the ASD and the control group are nearly identical. What 
is more, both can be seen to exhibit approximately linear behaviour on this plot; a mild 
deviation from this behaviour can be observed for the most frequent terms. There are two 
main reasons for this. Firstly, it is generally the case that the simple functional form of Zipf’s 
law fails to be observed for the top-ranking events. Secondly, our use of the logarithmic scale 
for the abscissa means that the left-hand tail of the graph is sparsely populated by data points 
which increases the corresponding error margins in this plot. 

To quantify the linearity of dependence plotted in Fig. |3(a)| we used Pearson’s r statistic. 
Specifically, we computed the value of the statistic first for the window which spans 20% of 
the ranks on the logarithmic scale and is centred at the logarithmic median of the ranks (i.e. 
the centre of the plot in Fig. |3(a)| >, and then proceeded to increase the width recomputing the 
value for the encompassed range until all ranks were included. The variation of Pearson’s r 
as a function of the window width is plotted in Fig. [3(b)] confirming our previous observa- 
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tions. The linearity of the central portion of the distribution is nearly perfect as witnessed by 
the r value of 0.999. As more data is added a slight non-linearity is observed, the statistic 
dropping down to approximately 0.992 for the entire range. 

In addition to their novelty, these findings are interesting in terms of directing our re¬ 
search towards our ultimate goal of automatically retrieving and analysing the content of 
ASD-related tweets. In particular, the discovery that tweets, including those in the sub-group 
of our primary interest, obey Zipf’s law suggests that it is sensible to adopt and explore the 
use of a broad range of well-known and well-understood text representations and methods 
of analysis. Indeed we do this next. 


4.3 Message length 

Having established that Twitter messages conform to some of the same general linguistic 
rules as conventional texts do, our next goal was to explore any differential characteristics 
exhibited by our ASD and control groups. Recall from Section |4~T| that our data set is bal¬ 
anced in the number of tweets, that is, the number of tweets in the ASD corpus is the same as 
in the control corpus. Yet, as pointed out in the previous section, the term counts in the two 
data sets are different, respectively 47,048,097 and 32,846,321 - a significant difference of 
approximately 43%. It is a direct implication of this observation that the average term count 
per tweet is greater in the ASD group, which is the first indication of there being a differ¬ 
ence between tweets in this group and the remainder of our data corpus. The tendency of 
this group to post longer messages is further corroborated by comparing the corresponding 
histograms of tweet word counts, which are shown in Fig. [4] While the tweet word count 
in both sub-groups can be seen to be log-normally distributed, the two distributions have 
substantially different means (p < 0.01). 
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(a) 



(b) 

Fig. 3: (a) A log-log plot showing the dependence of tweet term (word) frequency on its fre¬ 
quency rank within the corpus of all terms in our data set. Approximately linear dependency 
is exhibited by data extracted from both the ASD and the control data sets, in agreement with 
the Zipfian power law distribution, (b) Linear dependency is quantified using the Pearson’s 
r statistic. To account for the expected artefacts at the ends of the distribution, the statistic is 
plotted as a function of the width of the window centred at the logarithmically median rank 
and spanning 20-100% of the logarithmic range of possible ranks. Pearson’s r attains values 
of over 0.995 for the central portion of the distribution, and drops slightly (to approximately 
0.992) for the entire range when the end artefacts are included. 
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Fig. 4: Normalized histograms of tweet word counts for the two subsets. As expected, log¬ 
normal distributions are observed but, importantly in the context of the present work, with 
different parameters (most obviously the means, p < 0.01). 


4.4 Content analysis 

Our next aim was to explore if the collected data offers evidence that tweets in the ASD and 
control data sets differ significantly by their content (i.e. topic of conversation) and if so, if 
automatic methods could be employed in the analysis of this content. 


4.4.1 Word frequency 

In the first experiment we approached this task by comparing the most frequently used words 
in the two data sets. These words can be seen as a simple cumulative proxy for the actual 
content of individual messages. The key results are illustrated in Fig. [5] a) and Fig. |5Jb) 
which show the 100 most frequent words in respectively the ASD and control data sets, 
displayed as so-called ‘word-clouds’ whereby the frequency of a particular word is encoded 
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by the corresponding font size. We used a linear scale, the font size thus being proportional 
to the corresponding word’s frequency in a data set. 


(a) ASD ‘word-cloud' 


(b) Control ‘word-cloud’ 

Fig. 5: The most frequent words in the (a) ASD and (b) control data sets, shown as so- 
called ‘word-clouds’. The font size used to display a particular word is proportional to the 
corresponding word’s frequency in the data set. 


As readily observed from Fig. [5] the most commonly used words in the two groups of 
tweets reveal a substantial difference in the nature of discussed topics. A more thorough 
examination reveals even further meaningful patterns coherent with the existing literature 
on ASD. In particular, observe the presence of a large number of words in the plot corre- 



16 


Adham Beykikhoshk, Ognjen Arandjelovic, Dinh Phung, Svetha Venkatesh, and Terry Caelli 


sponding to the ASD group which are related to children such as “children”, “kids”, “child”, 
and “son”, for example. There are a number of reasons why this is unsurprising. Firstly, the 
diagnosis of ASD is usually made in early childhood so it is reasonable to expect that the 
parents confronting this new challenge would have increased initiative at seeking help from 
the community of parents in a similar situation. In addition, while undoubtedly always vul¬ 
nerable, the vulnerability of individuals on the autism spectrum is the greatest while they are 
young which is when they need the most support from their guardians and therapists e.g. at 
acquiring the skills needed to progress thorough the educational system and integrate in the 
society. 

It is also interesting to observe the high frequency of the words “son” and “boy” in the 
ASD data (top-right section of the word cloud), and the absence of the equivalent female sex 
words “daughter” and “girl”. This example illustrates well how the content of tweets can be 
used to extract some rather subtle information. In particular in this case our findings are 
consistent with the understanding of the medical community and a body of evidence which 
shows that boys are nearly five times more likely to suffer from autism than girls (39). 

In contrast, the most frequently used words in the control group do not seem to follow 
any particular pattern or focus of interest, and instead pertain to more general everyday inter¬ 
ests and activities. Lastly, it is interesting to observe that the set of words in Fig.[5ja) appears 
to contain more nouns than Fig. [5] b), and fewer verbs. This suggests a different nature of 
information exchange in the ASD group which appears more focused on issues (syndromes, 
disorders, interventions, support, help, and so on) and individuals (mostly children), both 
being described by nouns, while the users who posted tweets of the control group seem to 
be more interested in what it is that they are or will be doing (like, want, go, think, and so 
on). We will explore this quantitatively in more detail in Section |4.4.3| 

4.4.2 Hashtag analysis 

An alternative proxy for tweet content can be found in user-designated tags, the so-called 
‘hashtags’. These can be recognized by the leading special character *#’ (the hash sign) 
and are by convention understood to be meta-data labels in some manner connected with 
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the content of the tweet which contains them. Recall from Section |4.1| that 2,260,284 or 
approximately 25% of the collected tweets contain hashtags. 
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(b) Control hashtags ‘word-cloud’ 


Fig. 6: The most frequent hashtags in the (a) ASD and (b) control data sets, shown as so- 


called ‘word-clouds’. The font size used to display a particular word is proportional to the 


corresponding word’s frequency in the data set. Note that the hashtags corresponding to the 


search keywords have been removed before producing the ASD word-cloud. 
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We summarize our findings as word-clouds of the 100 most frequent hashtags in the 
ASD and control data sets in Fig.[6|a) and Fig.[6jb). As in the previous analysis of the most 
frequently used terms, the difference between the two data sets is readily apparent. Unsur¬ 
prisingly, most of the hashtags in the ASD data set pertain to health issues, some of which 
are ipso facto ASD-related, such as “mentalhealth” and “psychology”, while others are less 
obviously so. Examples of the latter include “fibromyalgia” and “vaccines”. Fibromyalgia 
is a class of disorders related to the body’s processing of pain which recent evidence sug¬ 
gests may have a potential connection with ASD 0Q). Similarly, although now discredited, 
previous research had suggested a causal link between children being vaccinated and devel¬ 
oping autism 0D- These examples are further evidence of the type of powerful information 
which can be harvested from Twitter. In particular, it shows that it is possible to data-mine 
tweets to provide feedback to medical practitioners on the concerns of the ASD community, 
the penetration (or lack thereof) of relevant public health recommendations (e.g. ABA or 
the applied behaviour analysis), or the adoption of treatments of questionable efficacy (e.g. 
homeopathy, gluten-free diet). 

Much like the word-cloud in Fig.[5ja), the hashtag word-cloud in Fig. [6] a) contains many 
references to various topics pertaining to the support of individuals on the autism spectrum. 
However, the nature of the two sets of terms is somewhat different. While the most fre¬ 
quently used terms mostly referred to the issues themselves, the corresponding hashtags 
mostly refer to particular support groups. For example, “autismspeaks” (centre-left position 
within the word-cloud) is a US-based autism advocacy organisation, “thisisautism” (centre- 
right position within the word-cloud) a Twitter stream used to share autism-related experi¬ 
enced, and ‘AS2DC’ (top-right corner of the word-cloud) an action summit on autism held 
in Washington D. C. 

4.4.3 Part-of-speech analysis 

In Section [4.4. 1[ in the discussion of our findings in the comparative analysis of the most 
frequently used words in the ASD and control tweets, we observed that the ASD tweets 
appeared to contain a greater proportion of nouns and a smaller proportion of verbs than 
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the control tweets. This observation led us to investigate this matter in further detail. In 
particular, we compared the usage of different part-of-speech (POS) types in the two data 
sets. 

We used the free TweetNLP software package j42l . trained on Twitter data, to tag auto¬ 
matically all terms in our entire data set according to their POS type. A selected set of the 
most interesting results is shown in the plots in Fig. [7] The first thing to observe from the 
entire corpus of results is that there is a consistent difference between the ASD and control 
tweets. Notwithstanding the large standard deviation values and the significant overlap of 
distributions corresponding to ASD and control tweets, that the difference between the two 
corpora is real rather than a result of stochasticity, is readily witnessed by the fact that in 
each of the plots the same relative behaviour is exhibited regardless of the tweet length. We 
confirmed this statistical significance rigorously too. In order to collate all data shown in a 
single plot, rather than considering the actual POS type count per tweet we considered POS 
type count normalized by tweet length. This allowed us to perform a single Student’s f-test 
per POS type (i.e. plot in Fig.[7|. In all cases we obtained p < 0.05. 

Next we turned our attention to the interpretation of the specific results in Fig. |7] Note 
by observing the plots in Fig. [TJa) and Fig. |7Jb) that our hypothesis from Section |4.4.1| 
regarding the type of information communicated within the ASD and control sub-groups, 
is confirmed. Recall that by examining qualitatively word clouds of the dominant terms in 
the two corpora we hypothesised that the tweets in the ASD sub-group are more focused 
on issues (syndromes, disorders, interventions, support, help, and so on) and individuals 
(mostly children), described by nouns, while the control tweets more often relate to what it 
is that their authors are doing (liking, wanting, going, thinking, etc). Indeed for the average 
tweet length (as counted by the number of words in a message) of 15-20 terms, the number 
of nouns in the ASD data set is about twice the number of nouns in the control data set. 
The difference in the number of verbs, while not as substantial, is also significant - for the 
average length message, the control data set contains approximately 20% more verbs. In¬ 
terestingly, the number of proper nouns does not appear to differ significantly between the 
two groups. This is perhaps somewhat surprising, as it could be expected that the control 
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group would engage more in the discussion of so-called celebrities. A more thorough, man¬ 
ual examination of relevant tweets reveals the reason - while the ASD group does engage 
in less chatter about celebrities, this is offset by the substantial attention that autistic peo¬ 
ple, specific autism activists, or politicians commenting or acting on autism-related issues 
receive. 


4.5 Tweet classification 

In Section |4.2.1| we provided evidence that tweets, including those made by the commu¬ 
nity interested in the ASD, obey some of the same general regularities when it comes to 
the pattern of word usage. Then, starting in Section [43] and corroborating this further in 
Sections |4.4.1| |4.4.2| and |4.4.3[ we showed that in terms of their semantics and content, 
messages of the ASD community appear to exhibit a range of characteristics which differ¬ 
entiate them from those in our control data set. This motivated us to explore if it is possible 
to classify tweets automatically as belonging to the ASD data set or not. 

It is important to emphasise right at the beginning that in this experiment we elimi¬ 
nated from all tweets the five search keywords which we used to divide our entire tweet 
data corpus into ASD and control sets i.e. which we used to define the quasi-ground truth 
labelling. Recall from Section |4~T| that these keywords are “autism”, “adhd”, “asperger”, and 
“aspie”. Had they not been removed, classification performance would have been artificially 
increased as their presence in a tweet would have been learnt as being capable of perfectly 
predicting classification output. Furthermore, it is reasonable to expect that the ‘true’ ASD 
corpus of tweets should include some tweets which do not contain any of the aforemen¬ 
tioned five keywords. In that sense, we hypothesised that it would be possible to use simple 
tweet filtering based on a small number of obvious keywords to build a bootstrap training 
data set which could be in turn used to learn additional aspects characterizing ASD-related 
tweets, thereby facilitatating a more robust retrieval of messages of interest. 

Discriminative power of individual terms To examine the discriminative potential of tweet 
content following the removal of our keywords, we first looked at the discriminative power 
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(a) Nouns (b) Verbs 




(c) Proper nouns 


(d) Pronouns 




(e) Adjectives (f) Adverbs 



Tweet word count 


(g) Interjections 


Fig. 7: A summary of the key results of comparative part-of-speech analysis. In all cases 
a statistically significant (p < 0.05) difference between ASD and control sub-groups is 
observed. The most interesting insight is that tweets in the ASD sub-group are more focused 
on issues (syndromes, disorders, interventions, support, help, and so on) and individuals 
(mostly children), described by nouns, while the control tweets more often relate to what it 
is that their authors are doing (liking, wanting, going, thinking, etc). 
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of individual words. This is effectively visualized in Fig. [8] What the plot shows is the most 
frequent 1300 words in our entire data set, which covers respectively 75% and 83% of the 
words in ASD and control subsets, depicted as circular blobs and colour-coded by their 
discriminative power (specifically, the saturation of a blob’s colour is proportional to the 
absolute value of the logarithm of the ASD-control likelihood ratio of the corresponding 
word). As expected based on our previous results presented in Sections |4.4.1| and |4.4.2| a 
number of words pertaining to issues of most concern to the ASD community are highly dis¬ 
criminative. This observation motivated our next set of experiments which examined tweet 
classification in detail. 


10' 1 


you 

rtuser 


love 


bite 

wann; 





brain 

fXVIftaH I 

training Education 


10" 3 10" 1 
Term frequency (ASD group) 


Fig. 8: A compact illustration of the discriminative power of different individual words. 
Each circular blob represents a word, the saturation of its colour being proportional to the 
absolute value of the logarithm of the ASD-control likelihood ratio of the word. 








Using Twitter to Learn about the Autism Community 


23 


4.5.1 Automatic classification 

Representations We evaluated a number of different representations of tweet content. Here 
we present three which overall produced the most interesting results. These are variations 
on well-known representations in the existing literature, adapted to the problem at hand. 
Specifically, we report on the performance of the following: 

- binary bag of terms (pre-processed words) ED ESI 

- integer bag of terms (also referred to as term count) ED, and 

- tf-idf (term frequency-inverse document frequency) score j461. 


In the binary bag of terms representation used in this work, each entry Xi in a feature 
vector Xbow = [%i,X 2 , ■ ■ ■, in.] 7 corresponds to a particular term and is coded as either 
present in a particular tweet (value 1) or absent from it (value 0), where Xb ow £ R nt! and n d 
is the dictionary size over which the feature vector is constructed (we will discuss vocabulary 
construction in more detail shortly). Thus, the original tweet: 

Looks like we will have more #autism research happening for children in #EarlyIn- 
ten’ention next year! :-) #VisualSupports #MobileTech 

which following our pre-processing ends up as: 

look like autism research happen child in earlyinteri’ention visualsupports mobiletech 


is represented using the following binary bag of words feature vector: 


^bow — 


look 

/ "i~' o 


like 

l 


research 

o... i' o 


happen 

o < l"' o... 


child 

/ "i~ s o 


Note that the term “autism” is not included as it is matched by one of our search keywords 
used for quasi-ground truth labelling, as described in Section |4.1| On the other hand, the 
terms “earlyintervention”, “visualsupports”, and “mobiletech” do not have the correspond¬ 
ing entries in the feature vector because their frequency across the data corpus is too low 
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i.e. they are not amongst the most frequent terms included in our vocabulary. Lastly, “in” is 
excluded (automatically, as all others) on the basis that it is too short. 

The term count representation is similar to the binary bag of words described previously. 
As before each entry in a feature vector corresponds to a particular term. However, unlike 
before the entries are not binary variables which signify the presence or the absence of a par¬ 
ticular term but rather actual counts of the term’s instances in a tweet. Since the previously 
given example of a tweet does not include any repeated terms, the term count feature vector 
x tc € in this instance is identical to Xb ow - On the other hand, the tweet mentioned in 
Section |4.1.1| which after pre-processing looks as follows: 

101 autism genet analyst individu autism find gene delet use power genet sequenc 

url 

produces different bag of words and term count feature vectors, the former having the value 
of 1 for the entry corresponding to the term “genet” and the latter the value of 2 since the 
term has two occurrences in the tweet. 

Lastly, the tf-idf representation x t f 1( jf € R n<i too has entries which correspond to differ¬ 
ent vocabulary terms, but with values which measure the importance of a particular term in 
a tweet. Specifically, a particular entry is equal to the frequency of the corresponding term 
(‘term frequency’) weighted by the inverse number of tweets in the training corpus which 
also contain the term (‘document frequency’). This representation has been used success¬ 
fully in a variety of applications, from text mining |46| to visual object recognition and 
retrieval m. 

Vocabulary construction We also explored a number of different ways of selecting the vo¬ 
cabulary over which tweet feature vectors are constructed based on the discriminativeness of 
terms, as well as their group-based or combined frequencies of use. Specifically, we exam¬ 
ined vocabularies formed by choosing the most frequent terms in one of the two sub-groups, 
the most discriminative terms in one of the two sub-groups (as explained in Scction [4~5] and 
visualised in Fig.[8|, the union or the intersection of either of the former (e.g. the union of the 
most discriminative terms in the ASD sub-group and the most discriminative words in the 
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control sub-group), the most frequent words across the entire corpus (i.e. both sub-groups), 
and the most discriminative words across the entire corpus. Our results suggest that the best 
performance is achieved by selecting a set of the overall most frequent terms; due to the 
limitations of space we report the corresponding results only, using the 1500 most frequent 
terms. 


Classification methodology In all experiments reported in this section we adopt the super¬ 
vised classification paradigm. Specifically, we assume that we have available a training set 
of pairs {(x^, y[^), (x^, y^), ■ ■ ■, (x^, yn^)} where x^ is the feature vector (repre¬ 
sentation) corresponding to the i-th training tweet and yf * its binary label signifying if the 
tweet belongs to the ASD or the control sub-group. After a classifier is trained the label of a 
novel query tweet described by the feature vector x is performed as follows: 


y = arg max 
v 


Pr(y)Pr(yi\y) 

Pr(x.) 


arg max Pr(y)Pr(x.\y). 


( 1 ) 


In our experiments we used half of the collected data corpus for training, and the remaining 
half for testing the performance of different representations and classifiers. Three popular 
classification methods were examined: 

- naive Bayes-based 1481 . 


- logistic regression-based 1491 . and 

- least absolute shrinkage and selection operator (LASSO)-based. 


In naive Bayes classification the strong assumption of independence between different 
terms in a feature vector is assumed resulting in the following class likelihood estimate: 

n w 

Pnb{v = ±l|x) oc \\Pr(xi\y = ±1), (2) 

i=l 

where, without loss of generality, the values +1 and -1 of the dependent variable y are used 
to signify respectively ASD and control sub-group memberships. Conditional probabilities 
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Pr{xi\y = ±1) are estimated in a straightforward manner from the frequencies of the 
corresponding terms in the training data corpus. 

In logistic regression, the conditional probability of the dependent variable y is modelled 
as a logit-transformed multiple linear regression of the explanatory variables xi,..., Xn B : 

P LR (y = ± l|x,w) = (3) 


The model is trained (i.e. the weight parameter w learnt) by maximizing the likelihood of 
the model on the training data set, given by: 


n Pr (^ 


(*)!„(*) =A - 


1 

I 1 CO T (*) ' 

fj[ 1 + e~ y i w x i 


penalized by the complexity of the model: 


(4) 


— -^rW W 


t\J 2n 


(5) 


which can be restated as the minimisation of the following regularized negative log-likelihood: 

nt 

£ = C log ^1 + e _y,w Xl j+w T w. (6) 

i =1 

A coordinate descent approach described by Yu et al. [j49'| was used to minimize £. 

Finally, in LASSO-based classification, sparseness of the solution is achieved by replac¬ 
ing the L 2 -norm penalty in (|6]i with Ly-norm, yielding the following regularized negative 
log-likelihood: 

n t 

£ = C , y]log(l + e- ?/lwTx -) +A||w||i. (7) 

2 = 1 

where A is a free parameter governing the tradeoff between prediction loss and solution 
sparsity. 


Basic classification: results and discussion We first evaluated the naive Bayes and logistic 
regression-based classifiers as these do not have free parameters i.e. parameters which must 
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be set a priori. The corresponding classification results are summarized in Table [I] With 
the exception of the result achieved with logistic regression and the tf-idf representation, 
different combinations of base classifiers and representations performed nearly identically. 
Specifically, it can be seen that approximately 79% of tweets were correctly classified. This 
is a rather remarkable performance considering the brevity of tweets and the fact that some 
of the most informative (in the context of the classification task at hand) words were not 
used for classification. It is insightful to notice the consistently higher accuracy attained for 
control (84-85%) rather than ASD tweets (71-74%). We explored this finding in more depth 
by manually inspecting misclassified messages. Within the ASD data corpus, we found that 
the main source of classification error lied in the absence of any ADS-specific information 
in very short tweets, after the removal of our four keywords used to construct the data set 
in the first place (e.g. “it is the adhd, oops!”). This is a highly comforting finding since of 
course in any practical application these keywords would not be eliminated, thus improving 
classification performance dramatically. 


Table 1: A summary of the classification results (ASD-related vs. non-ASD-related tweets). 
Each cell (corresponding to a combination of a classification method and a representation) 
shows the associated confusion matrix. The first row/column of a confusion matrix corre¬ 
sponds to the control class and second row/column to the ASD class (thus for e.g. using the 
term count representation, naive Bayes correctly correctly classified 84% of control tweets 
and 71% of ASD tweets). 



Confusion matrix 

Representation 

Binary 

Term count 

tf-idf 

Naive Bayes 

0.84 0.16 

0.29 0.71 

0.84 0.16 

0.29 0.71 

0.84 0.16 

0.38 0.62 

Logistic regression 

0.85 0.15 

0.26 0.74 

0.85 0.15 

0.27 0.73 

0.03 0.97 

0.04 0.96 
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We also experimented with the application of latent semantic analysis (LSA) 150'], test¬ 
ing out the possibility that better discrimination may be achieved by working in the so-called 
concept rather than term space. Although LSA has been shown to be highly successful in 
a variety of information retrieval tasks, our experiments suggest that the same benefit is 
not observed when it is applied on tweets. The likely reason for this can be found in the 
brevity of tweets, which favours the use of succinct expressions whereby individual terms 
(unigrams) themselves effectively become LSA concepts. This also explains why the simple 
naive Bayes classifier performed on par with the generally superior logistic regression-based 
one, as can be seen in Table [T| 

We next evaluated the LASSO-based classifier. In order to examine fully the behaviour 
of this approach we did not perform the usual procedure for selecting the value of the free 
parameter A using cross-validation; rather we measured classification performance across a 
range of A values. Specifically we considered values from A = 10e -5 up to A = 10e -2 ; 
greater values than 10e~ 2 produced numerical problems caused by an excessively small set 
of selected predictor terms. Our results are summarized in Fig. [9] The plot shows that the 
classifier exhibits consistent performance across the range of A between 10e -5 and 10e -4 , 
with the accuracy of the classification of ASD tweets deteriorating thereafter (with a small 
increase in the accuracy of control tweet classification). This deterioration is readily ex¬ 
plained by examining the number of terms selected by LASSO i.e. by the number of terms 
with non-zero corresponding regression coefficients. For A = 10e~ 5 this number is 801, 
dropping down for A = 10e -4 to 580, and reducing drastically to 151 for A = 10e~ 3 and 
to 12 for A = 10e -2 as a consequence of the harsh penalty on model complexity effected 
by the large values of A. 

Lastly, useful insight can be gained by considering the terms which correspond to the 
largest magnitude regression coefficients. These are, in order, “help”, “kid”, “aw” (the au¬ 
tomatically constructed stem of words such as “awareness” and “aware”), and “child”. The 
corresponding regression coefficients are all positive i.e. the presence of these terms pre¬ 
dicts ASD-related tweets. Consequently, control tweets are predominantly predicted by the 
absence rather than presence of specific terms; the highest magnitude negative coefficient, 
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io' 5 icr 4 io 3 io~ 2 

A 


Fig. 9: LASSO-based classification performance as a function of the loss-sparseness tradeoff 
parameter A. The number of non-zero regression coefficients for A equal to 10e -5 , 10e -4 , 
lOe , and 10e -2 is respectively 801, 580, 151, and 12. High values of A which severely 
penalize model complexity can be seen to result in deteriorating classification performance 
for ASD-related tweets. 

that is coefficient which corresponds to the term which predicts a control tweet, is “follow”. 
The observation that control tweets are predicted by the absence of ASD-related terms ex¬ 
plains why a reduction in model complexity (i.e. a reduction in the number of ASD-terms 
considered by the classifier) does not produce a deterioration in classification performance 
on control tweets (rather, a small improvement can be noticed) while a major negative effect 
is readily observed for the ASD corpus. 

Condition-oriented classification: results and discussion To motivate our next set of experi¬ 
ments, consider a classifier which distinguishes between medical condition-related and other 
tweets. Let the classifier correctly classify a tweet of the former type (dependent variable 
value +1) with the probability p m and of the latter type with the probability p n (dependent 
variable value —1). Imagine applying this classifier on our data set but interpreting its output 
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as classifying tweets as being ASD related (+1) or not (—1). Since all ASD tweets are medi¬ 
cal condition related, the probability of the classifier output being +1 given an ASD tweet is 
also p m ■ On the other hand, the control corpus contains both tweets not related to a medical 
condition as well as those which are related to medical conditions other than ASD. Hence, 
the probability of the classifier output being —1 given a tweet from our control corpus is: 

Pn(l - pm) + (1 - Pm)pm (8) 

where p m is the proportion of control tweets which are medical condition-related. The first 
term, p n ( 1 — pm), is the probability of a control tweet not related to a medical condition 
being selected as input to the classifier (probability 1 — p m ) and correctly classified (prob¬ 
ability p n )- The second term, (1 — p m )p, is the probability of a medical condition related 
control tweet being selected as input (probability p m ) and incorrectly classified (probability 
1 — Pm), as the classifier output of interest is —1. The expression in l[8]> can be rearranged to 
give: 


Pn + (1 ~Pm - Pn)pm (9) 

Since the proportion p m of medical condition related tweets is likely to be small, it can 
be seen that a good classifier which differentiates between medical condition related and 
other tweets can appear to be a good classifier of ASD vs. non-ASD-related tweets. Thus 
the question is if the results of our experiments described in the previous section really 
demonstrate good ASD vs. non-ASD discrimination, or is it simply the case that our methods 
learnt to distinguish between medical condition-related and non-medical condition-related 
tweets. We set out to investigate this next. 

For this experiment we followed the methodology described in the previous set of ex¬ 
periments with the sole difference being that learning was done using a different control 
corpus. Specifically, we collected a data set related to another pervasive medical condition 
- Alzheimer’s disease. In a similar manner as in Section l4~Tl we collected tweets which con¬ 
tain the word “alzheimer” (or any of the words derived from it by suffixation), balancing 
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their number with the number of ASD-related tweets in our data set. Considering the little 
difference in performance obtained by the use of different representations in our previous 
set of experiments, we did not examine them all here too; instead we adopted the term count 
representation as the most commonly used one in the literature. 

Our results are summarised in Table[2] It can be readily observed, especially in the con¬ 
text of the evidence presented previously, that this experiment too strongly supports our 
thesis that the content of ASD tweets, even with the most salient words removed (the key¬ 
words used to collect the corpus as described in Section |4~T| is highly informative. Both 
Alzheimer’s disease-related and ASD-related tweets were classified with nearly perfect ac¬ 
curacy. As in our previous experiments there was little difference between the performance 
of the naive Bayes classifier and that based on logistic regression, the latter achieving some¬ 
what better results on ASD-related tweets (100% vs. 98% accuracy). 

Table 2: A summary of the classification results (ASD-related vs. Alzheimer disease-related 
tweets). Each cell shows the associated confusion matrix. The first row/column of a con¬ 
fusion matrix corresponds to the Alzheimer’s disease class and second row/column to the 
ASD class. 



Confusion matrix 

Representation 

Term count 

Naive Bayes 

0.98 0.02 

0.05 0.95 

Logistic regression 

0.98 0.02 

0.00 1.00 


4.6 Bootstrap keyword set analysis 

Recall from Section |4~T| that we collected the ASD tweet corpus by including in it any tweet 
which contained any of the four keywords “autism”, “adhd”, “asperger”, and “aspie” (or 
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any of their derivatives obtained by suffixation). Moreover, in all experiments in which the 
difference between tweets related and unrelated to ASD was examined, the ground truth was 
defined by the presence of these words. This was necessitated by the scale of the data which 
made it practically impossible to perform labelling manually. Consequently, these keywords 
had to be excluded from the consideration in classification experiments. Clearly this would 
not be done in any practical application which means that the results we presented are worse, 
likely significantly so, than they would have been had we been able to use the entirety of 
tweet content. The high classification accuracy we obtained of approximately 80% provides 
strong evidence that even with these most salient words removed, the remaining content of 
ASD-related tweets is highly informative and characteristic. In the final set of experiments 
we report here we examined the sensitivity of learning to the exact keywords used to select 
the learning corpus. 


We repeated the classification experiments from the previous section but using for train¬ 
ing only a part of the previous training corpus - the part that is matched by three of the four 
keywords. Thus four experiments were performed, with one of the keywords being left out 
in each. Then we first evaluated the performance of thus learnt classifier on the test corpus, 
just as before. We additionally evaluated the performance of the classifier specifically on 
those tweets which contained the left out keyword but none of the other three. Our results 
are summarized in the bar plot in Fig. [To] Firstly, observe that in no experiment was the 
classification performance on the test set greatly negatively affected - in all cases 83-85% 
of control tweets were correctly classified and 71-76% of ASD-related tweets. This is in line 
with our previous observations and is highly reassuring since it suggests that the same type 
of characteristic context is learnt regardless of which specific keywords are used to collect 
training data. However, the classification results on the tweets only matched by the left out 
keyword is interesting and provides novel insight. In particular, observe the particularly bad 
performance in the experiment in which the keyword “adhd” was left out - only about 44% 
of the tweets matched by this keyword were correctly classified. A manual examination of 
the misclassified tweets readily revealed the answer - in nearly all cases these seemingly 
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misclassified tweets indeed were not related to ASD. Rather the term “adhd” was used as a 
hyperbole for an inattentive or easily distracted person. Here are a few examples: 

- Watching football is such a commitment. I’m too adhd for this. 

- I think we both have adhd 

- I feel like I have adhcl.. I cannot write 300 words and not stop for a “break” 

- And then your adhd causes you to forget everything as soon as you get on your 
phone 

We made a similar observation with regard to the usage of the word “aspie” which too is used 
loosely in colloquial speech (albeit with a lower frequency than “adhd”). This explains the 
performance of the classifier trained on data matched by the other three keywords and eval¬ 
uated on the tweets matched by “aspie” only, which was also worse than the corresponding 
performance in the experiments in which the left out keyword was “autism” or “asperger” 
but not as much as when “adhd” was left out. 

In summary, this experiment offers several important insights useful in the practical 
application of the ideas presented in the present paper. Firstly, it showed that the choice 
of keywords does not need to be a complex process in the sense that a small number of 
keywords is sufficient to learn the relevant salient characteristics of a theme associated with 
them. Additional keywords appear to be redundant, adding no new information and not 
resulting in improved performance. Secondly, the experiment demonstrated how potentially 
bad keyword choices can be automatically detected and therefore removed. This can be done 
either fully automatically or in a semi-automatic fashion following human input (approval). 

5 Summary and conclusions 

The aim of this work was to investigate the potential of data-mining messages posted on 
Twitter to learn about the concerns, practices, and more generally topics of conversation 
of people interested in the autism spectrum disorder. We approached this problem first by 
harvesting a large data set of tweets each of which we designated as belonging either to 
the ASD-related subset of the data corpus or the control subset. Using this data set we con¬ 
ducted a series of experiments which analysed both common and differential characteristics 
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0.9 



autism adhd asperger aspie 

Left out terms 


Fig. 10: Classification results using only a subset of the entire training corpus used in the pre¬ 
vious section. In each experiment training was performed using tweets matched by the three 
remaining keywords after one of the original keywords was left out. For each experiment we 
show three classification accuracies: on (i) test corpus control tweets (green bars), (ii) test 
corpus ASD-related tweets (blue bars), and (iii) tweets matched by the left out keyword but 
no others (red). See main text for a discussion of the findings. 


of messages in the two subsets and reported a series of novel results. The most important 
finding, corroborated by several different experiments, concerns the nature of topics arising 
in the ASD subset of tweets. We demonstrated that tweets in this subset are very rich in 
information of potentially high value to public health officials and policy makers, thus mo¬ 
tivating further work towards our goal of developing a tool which would be able to monitor 
automatically the response of the ASD community to various initiatives, legislature, medical 
advances etc. For example, driven by the finding that a high proportion of the tweets in our 
corpus contain URL links, one of the directions we wish to pursue in future is that of ex¬ 
ploring the value of auxiliary information which can be associated with original messages. 
We also intend to explore automatic ways of labelling topics semantically using meaningful 
sentence fragments by back-analysing probabilistically the collected text data for persistent 
ngrams across documents with shared topics 1511 . 
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